Data Visualization: Day 1

Erik Westlund

2025-06-02

Welcome to Data Visualization

Housekeeping

  • Who am I?
  • Contact information
  • Course materials

Who am I?

  • I am a data scientist at the Johns Hopkins Bloomberg School of Public Health
  • I work out of the Johns Hopkins Biostatistics Center in the Department of Biostatistics
  • I was trained in the social sciences and have worked profesionally as a data scientist and software developer for over 10 years

Contact information

  • Email: ewestlund@jhu.edu

Course materials

Course Goals

  • Understanding of core data visualization concepts
  • Develop strong data science & data visualization workflows
  • Learn to produce high-quality data visualizations
  • Learn to communicate effectively with and about data visualizations

Course Outline

  • Introduction to data visualization
  • Tooling & worfklow
  • Data preparation
  • Grammar of graphics
  • Making good, honest graphics
  • Dashboards

Introduction to Data Visualization

Edward Tufte: Graphical Excellence is….

Edward Tufte

“Graphical excellence is the well-designed presentation of interesting data—a matter of substance, of statistics, and of design…”

Edward Tufte, The Visual Display of Quantitative Information, 1983

Dense

Edward Tufte

“It is that which gives to the view the great number of ideas in the shortest time with the least ink in the smallest space…”

Edward Tufte, The Visual Display of Quantitative Information, 1983

Multivariate

Edward Tufte

“It is nearly always multivariate…”

Edward Tufte, The Visual Display of Quantitative Information, 1983

Truthful

Edward Tufte

“Graphical excellence requires telling the truth about the data…”

Edward Tufte, The Visual Display of Quantitative Information, 1983

Exemplar: Napoleon’s March

Charles Minard’s Napoleon’s March

Achieving Minard’s Graphical Excellence

“[Minard’s classic image] can be described and admired, but there are no compositional principles on how to create that one wonder graphic in a million.””

Edward Tufte, The Visual Display of Quantitative Information, 1983

For The Rest of Us

Instead, Tufte suggests:

  • “[For] more routine, workaday designs”
  • “[Have] a properly chosen format and design”
  • “Use words, numbers, and drawing together”
  • “[D]isplay an accessible complexity of detail”
  • “Avoid content-free decoration, including chartjunk”

We will revisit more of Tufte’s principles throughout the course.

Tooling & Workflow

  • It is worth investing in learning your tools
  • A good data visualization workflow requires good tooling and workflow
  • Below we will discuss some of the tools we will use in this course and why we use them

Required Software For This Course

R

  • We will rely mostly on R for this course
  • R can be downloaded from r-project.org

R Logo

ggplot2

  • ggplot2 is a powerful package for creating data visualizations
  • It is built on the grammar of graphics
  • It is a declarative grammar for data visualization

ggplot2 Logo

git

  • git is a powerful tool for version control
  • It allows you to
    • track changes to your code
    • revert to previous versions of your code
    • collaborate with others on your code
    • maintain multiple branches/versions of your code
    • and more

git Logo

GitHub

  • git is not GitHub
  • GitHub is a web-based platform for hosting and collaborating on code
  • It is technically a remote repository for git
  • It gives you a place to store your code and collaborate with others
  • It is free for open source projects

GitHub Logo

Scientific Notebooks

  • Notebooks are a powerful way to work with data and do data visualization
  • They allow you to embed code, text, and visualizations in a single document
  • They thus allow you to easily share both the process and the results of your work
  • I do not require a specific notebook system for this course, but I will be using Quarto for examples

Notebook Logo

Notebooks: Quarto

  • Quarto is an open source scientific and technical publishing system
  • You can create reports, websites, presentations, and books with Quarto
  • This presentation is built with Quarto
  • You can embed Python, R, and other code in your Quarto documents
  • Quarto renders down to a document in HTML, PDF, or Word format; the files them selves are plain text
  • Easy to store notebooks in version control with git

Quarto Logo

RMarkdown

  • RMarkdown is a way to create documents that mix R code and text
  • It integrates with RStudio well and has a very similar workflow to Quarto
  • RMarkdown renders down to a document in HTML, PDF, or Word format; the files them selves are plain text
  • Easy to store notebooks in version control with git

RMarkdown Logo

Jupyter

  • Jupyter is a notebook system popular with Python users
  • Jupyter stores code and results in the same document (Quarto/RMarkdown render into a separate document)
  • Jupyter supports R and other languages
  • Jupyter stores itself as JSON (javascript object notation) files and are not as easy to diff in git

Jupyter Logo

Optional/Popular Software

RStudio

  • RStudio is a powerful IDE for R
  • It is free and open source
  • It helps you understand what is in your environment (e.g., variables, functions, packages, etc.)
  • It also makes it easy to view your visualizations as you make them

RStudio Logo

Python

  • Python is a powerful general purpose programming language
  • It is very popular in the data science community, especially in machine learning

Python Logo

LLMs

  • LLMs are commonly used to help with code
  • Common ones used in data science are ChatGPT, Claude, Gemini, and GitHub’s Copilot
  • They can help you write code, debug code, and write documentation
  • They can also make mistakes, so you cannot blindly trust their work

GitHub Copilot Logo

AI, LLMs, and Data Visualization

AI and Data Visualization

  • AI and LLMs are becoming more and more powerful
  • They can help you with many data-related tasks, but require care
  • They are allowed in this course, but you are responsible for checking their work
  • Data visualization is a scientific area where, to an extent, if things “look right” they are probably right

My Philosophy

  • I use LLMs in nearly all aspects of my work
  • I have foound that there is now less value in being able to “make computer do something” and more in high level concepts
  • To that extent, in this course we will try to focus a little more on concepts and less on ggplot2 syntax, since LLMs really can mostly solve technical visualization problems
  • Let’s try it out.

Python test

4